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In this paper we propose a general series method to estimate a 
semiparametric partially linear varying coefficient model. We estab- 
lish the consistency and -^/n-normality property of the estimator of 
the finite-dimensional parameters of the model. We further show that, 
when the error is conditionally homoskedastic, this estimator is semi- 
parametrically efficient in the sense that the inverse of the asymptotic 
variance of the estimator of the finite-dimensional parameter reaches 
the semiparametric efficiency bound of this model. A small-scale sim- 
ulation is reported to examine the finite sample performance of the 
proposed estimator, and an empirical application is presented to il- 
lustrate the usefulness of the proposed method in practice. We also 
discuss how to obtain an efficient estimation result when the error is 
conditional heteroskedastic. 



1. Introduction. Semiparametric and nonparametric estimation techniques 
have attracted much attention among statisticians and econometricians. One 
popular semiparametric specification is a partially linear model as consid- 
ered by Robinson (1988), Speckman (1988) and Stock (1989), among others, 
via 

(1) Yi = v'ff + 5(zi) + Ui, i = l,...,n, 

where the prime denotes transpose, 1^7 is the parametric component and 
5(zi) is an unknown function and, therefore, is the nonparametric component 
of the model; see Green and Silverman (1994), Hardle, Liang and Gao (2000) 
and the references therein for more detailed discussion of this model. This 
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model can be generalized to the following semiparametric varying coefficient 
model: 

(2) Yi = v' i -f{z i ) +5(zi) +Ui, i = l,...,n, 

where j(z) is a vector of unknown smooth functions of z. Define X{ = (1,^)' 
and P(z) = (S(z), 7(2)')'. Then (2) can be written more compactly as 

(3) Y i = x l i j3(z i ) +Ui, i = l,..., n. 

The varying coefficient model is an appropriate setting, for example, in the 
framework of a cross-sectional production function where Vi = (Labor CapitaLj' 
represents the firm's labor and capital inputs, and m = R&Dj is the firm's 
research and development expenditure. The varying coefficient model sug- 
gests that the labor and capital input coefficients may vary directly with the 
firm's R&D input, so the marginal productivity of labor and capital depend 
on the firm's R&D values. While the partially linear model (1) only allows 
the R&D variable to have a neutral effect on the production function, that 
is, it only shifts the level of the production frontier, it does not affect the 
labor and/or capital marginal productivity. Li, Huang, Li and Fu (2002) use 
the nonparametric kernel method to estimate the semiparametric varying 
coefficient model (2) and apply the method to China's nonmetal mineral 
manufacturing industry data; their results show that the semiparametric 
varying coefficient model (2) is more appropriate than either a parametric 
linear model or a semiparametric partially linear model for studying the 
production efficiency in China's nonmetal mineral manufacturing industry. 

The time-series smooth transition autoregressive (STAR) model is an- 
other example of the varying coefficient model. It is given by yt = x' t (3(y t -d) + 
ut, where (3{yt~d) is a vector of bounded functions; see Chen and Tsay (1993) 
and Hastie and Tibshirani (1993). They consider an autoregressive model 

of the form y t = fi(y t -d)yt-i + f2(yt-d)yt-2 H H f P {yt-d)yt- P + u t , where 

the functional forms of the /j(-)'s (j = 1, . . . ,p) are not specified. Chen and 
Tsay (1993) and Hastie and Tibshirani (1993) discuss the identification of 
fj(-) and suggest some recursive algorithms to estimate the unknown func- 
tion fj(-). More recent work on varying coefficient models can be found in 
Carroll, Fan, Gijbels and Wand (1997) and Fan and Zhang (1999), who pro- 
pose a two-step procedure to accommodate varying degrees of smoothness 
among coefficient functions. See also Hoover, Rice, Wu and Yang (1998), 
Xia and Li (1999), Cai, Fan and Yao (2000), Cai, Fan and Li (2000), Fan 
and Huang (2002) and Zhang, Lee and Song (2002) on efficient estimation 
and inference of semiparametric varying coefficient models by using the local 
polynomial method and Fan, Yao and Cai (2003) on adaptive estimation of 
varying coefficient models. 

The semiparametric varying coefficient model has the advantage that it 
allows more flexibility in functional forms than a parametric linear model 
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or a semiparametric partially linear model, and, at the same time, it avoids 
much of the "curse of dimensionality" problem, as the nonparametric func- 
tions are restricted only to part of the variable z. However, when some of 
the (5 coefficients are indeed constants, one should model them as constants 
and, in this way, one can obtain more efficient estimation results by incor- 
porating this information. Consider again the production function example: 
if one further separates the capital into liquid capital and fixed capital, it 
is likely that the level of R&D will affect the marginal productivity of fixed 
capital, but not that of liquid capital. This gives rise to a partially linear 
varying coefficient model as follows: 

(4) Yi = w'ij + XiP(zi) + ttj, i = l,...,n, 

where Wi is a vector of variables whose coefficient 7 is a vector of constant 
parameters, and say, w is the firm's liquid capital in the above production 
example. 

In this paper we propose to estimate the partially linear varying coef- 
ficient model (4) using the general series method, such as spline or power 
series. We show that the series method leads to efficient estimation for the 
finite-dimensional parameter 7 under the conditional heteroskedastic error 
condition. Recently, Fan and Huang (2002) suggested using the kernel-based 
profile likelihood approach to estimate a partially varying coefficient model 
[this paper was brought to our attention after the first submission of our 
paper], and they show that their approach also leads to efficient estima- 
tion of the finite-dimensional parameter 7 when the error is conditional ho- 
moskedastic. In this paper we also argue that the efficient estimation result 
of the series-based method can be extended to the conditional heteroskedas- 
tic error case in a straightforward way. It is more difficult to obtain efficient 
estimation results using the kernel-based method when the error is con- 
ditional heteroskedastic. Moreover, the series estimators have well-defined 
meanings as estimating the best approximation function for the unknown 
conditional mean regression function even when the model is misspecified. 
The payoff of using the general series estimation methods is that it is difficult 
to establish the asymptotic normality result for the nonparametric compo- 
nents under optimal smoothings (i.e., balance the squared bias and variance 
terms). Thus, the series method should be viewed as a complement to the 
kernel method in estimating a partially linear varying coefficient model. 

2. Estimation. Consider the following partially linear varying coefficient 
model: 

(5) Yi = w^j + Xif3(zi) + Ui, i = l,...,n, 

where Wi is a q x 1 vector of random variables, 7 is a q x 1 vector of un- 
known parameters, X{ is of dimension d X 1, zi = (zn, . . . , Z{ r ) is of dimension 
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r, /?(•) = (Pi(-), ■ ■ ■ ,(3d{-))' is a d x 1 vector of unknown varying coefficient 
functions, and Ui is an error term satisfying E(u,i\wi,Xi, Zi) = 0. 

With the series estimation method, for I = 1, ...,d, we approximate the 
varying coefficient function fli(z) by p k '(z)'a kl , a linear combination of fcj 
base functions, where p k[ (z) = \pn(z), . . . ,Piki(z)]' is a fc; x 1 vector of base 
functions and oft 1 = (an, . . . , a/fc,)' is a k\ x 1 vector of unknown parameters. 
The approximation functions p kl {z) have the property that, as ki grows, 
there is a linear combination of p k[ (z) that can approximate any smooth 
function 0i(z) arbitrarily well in the sense that the approximation mean 
square error can be made arbitrarily small. 

Define the K x 1 matrices p K (xi,Zi) = (xiip kl (zi)' ', . . . ,Xidp k d d {zi)')' and 

a = (a kl ', . . . , ot k d d ')' i where K = J2f=i h- Thus, we use a linear combination 
of K functions, p K (xi,Zi)'a, to approximate x' i /3(zi). Hence, we can rewrite 
(5) as 

Yi = w' i 7 + p K (xi,Zi)'a + (x'iPizi) - p K (xi,Zi)'a) +Ui 

(6) 

= w'il + P?( x i> Zi)'a + error j, 

where the definition of errorj should be apparent. 

We introduce some matrix notation. Let Y = (Yi, . . . , Y n )' , u = (ui, 
W = (w 1 ,...,w n y,G=(x' 1 f3(z 1 ),...,x , n P(z n ))' &ndP=(p K (xi,z 1 ),. 
Hence, model (6) can be written in matrix notation as 

(7) Y = Wj + Pa + error. 

Let 7 and a denote the least squares estimators of 7 and a obtained by 

regressing Y on (W, P) from (7). Then we estimate @i(z) by (z) = p\ l {z)'ai 
(I = 1, . . . , d). We will establish the -y/n-normality result for 7 and derive the 
rate of convergence for $i(z). 

We present an alternative form for 7 and a that is convenient for the 
asymptotic analysis given below. In matrix form, (5) can be written as 

(8) Y = Wj + G + u. 

Define M = P(P'P)~ P' , where (•)" denotes any symmetric generalized 
inverse of (•). [Under the assumptions given in this paper, P'P is nonsingular 
with probability one. In finite sample applications, if P'P is singular, one can 
remove the redundant regressors to make P'P nonsingular.] For an n x m 
matrix A, define A = MA. Then premultiplying (8) by M leads to 

(9) Y = W*y + G + u. 
Subtracting (9) from (8) yields 



. . . , u n ) , 

K 1 
■ ■ ,P [Xr 



(10) Y - Y = (W - W)7 + (G - G) + u- u. 
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7 can also be obtained as the least squares regression of Y — Y on W — W, 
that is, 

(11) 7 = [(W - W)'(W - W)]-{W - W)'{Y - Y). 
And a can be obtained from (7) with 7 being replaced by 7, 

(12) a = {P'P)-p'(Y -Wj), 

from which we obtain $i(z) = p t l (z)'af l , I = 1, . . . , d. 

Under the assumptions given below, both (W — W)'(W — W) and P'P 
are asymptotically nonsingular. Hence, 7 and a given in (11) and (12) are 
well defined and they are numerically identical to the least squares estimator 
obtained by regressing Y on (W, P). 

Next we give a definition and some assumptions that are used to derive 
the main results of this paper. 

Definition 2.1. g(x, z) is said to belong to the varying coefficient class 
of functions Q if: 

(i) g(x,z) = x'h(z) = J2?=i x ihi(z) for some continuous functions h[(z), 
where h(z) = (hx(z), . . . , h d (z))' . 

(ii) Ya=i E[ x iihi( z i) 2 ] < 00 > where X\ (xu) is the Ith component of x (xj). 

For any function f(x,z), let Eg[f(x,z)] denote the projection of f(x,z) 
onto the varying coefficient functional space Q (under the I/2- n orm). That 
is, Eg[f(x,z)] is an element that belongs to Q and it is the closest function 
to f(x,z) among all the functions in Q. More specifically (x[ is the Ith. 
component of x, I = 1, . . . , d), 

E{{f{ X ,z) - Eg[f( X ,z)])(f(x,z) - Eg[f(x,z)])'} 

(13) r / 

inf El[f(x,z) 



d \ / d 

1=1 ) \ 1=1 



Thus, 

E[(f(x,z) - Eg[f(x,z)])(f(x,z) - Eg[f(x lZ )])'] 

(14) 



< E 



f d \ / d N 

f(x,z) -^2xihi(z) J f(x,z)-J2 x i h i( z ) 

k 1=1 /V 1=1 / 



for all g(x, z) = J2i=i x ihi(z) G Q. Here for square matrices A and B, A< B 
means that A — B is negative semidefinite. 

Define 6(x, z) = E[w\x, z] and m(x, z) = Eg[9(x, z)\. The following as- 
sumptions will be used to establish the asymptotic distribution of 7 and 
the convergence rates of (3(z). 
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Assumption 2.1. (i) (Yii w h x i-> z i)i=\ are independent and identically 
distributed as (Y"i, w%, x±, z\) and the support of (w\,x\,Z\) is a compact 
subset of K<* +d+r ; (ii) both 0(xi,zi) and var[Yi| zi] are bounded func- 

tions on the support of (w\,x±, Zi). 

Assumption 2.2. (i) For every K there is a nonsingular matrix B such 
that for P K (x, z) = Bp K (x, z) the smallest eigenvalue of E[P K (x%, zi)P K (xi, Zi)'} 
is bounded away from zero uniformly in K; (ii) there is a sequence of con- 
stants Co{K) satisfying sup^^g^ \\P K (x, z)\\ < CoC^O an d K = K n such that 
(Co{K)) 2 K I n — ► as n — ► oo, where S is the support of (xi,zi), and for a 
matrix A, ||A|| = [tr(A'A)] 1 / 2 denotes the Euclidean norm of A. 

Assumption 2.3. (i) For f(x,z) = J2f=i x iPi( z ) or f( x ,z)=mj(x,z) 
(j = 1, . . . there exist some <5j > (Z = 1, . . . , d), atf = atfK = (c^i 1 ' \ ■ ■ ■ j Q^O'i 
such that sup( x |/(a:, z) — P K (x, z)'aif \ = 0(J2i=i ! )> (h) for min{A;i, ■ ■ ■ ,kd 
oo, \A^(Sf=i k^ 2Sl ) — ► as n — > oo. 

Assumption 2.1 is a standard assumption being used on series estimation 
methods. Assumption 2.2 usually implies that the density function of (x, z) 
needs to be bounded below by a positive constant. Assumption 2.3 says that 
there exist some Si > (I = 1, . . . ,d) such that the uniform approximation 
error to the function shrinks at the rate J2i=i Assumptions 2.2 and 2.3 
are not the easiest conditions, but it is known that many series functions 
satisfy these conditions, for example, power series and splines. 

Under the above assumptions, we can state our main theorem. 

Theorem 2.1. Define £j = Wi — m(xi, Zi), where m(xi, Zi) = Eg (wi), and 
assume that <3? = £7[eje£] is positive definite. Then under Assumptions 2.1- 
2.3 we have: 

(i) y/n(^ — 7) — > N(0, S) in distribution, where T< = , Q = E[a 2 (wi,x 
and a 2 (wi,Xi,Zi) = E[u 2 \wi,Xi, Zi\. 

(ii) A consistent estimator of E is given by £ = $ fi$ , where <3E> = 
" _1 £"=i(w^- ^iX^i -w i }',Q = n~ 1 Y^ =1 uf(w i - Wi)(wi - Wi)' , Wi is the 
ith row of W and Ui = Yi — w'fj — p K (xi, Zi)'a. 

The proof of Theorem 2.1 is given in the Appendix. [One may prove The- 
orem 2.1 based on the general result of Shen (1997) and Ai and Chen (2003) 
which requires one to establish stochastic equicontinuity of the objective 
function. However, for the specific partially linear varying semiparametric 
model, it is easier to use a direct proof as given in the Appendix.] 
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Under the conditional homoskedastic error assumption E[u 2 \wi, X{, Zj\ = 
E{u 2 ) = a 2 , the estimator 7 is semiparametric efficient in the sense that the 
inverse of the asymptotic variance of \/n{fy — 7) equals the semiparametric 
efficiency bound. From the result of Chamberlain (1992) [the concept of semi- 
parametric efficient bound we use here is discussed in Chamberlain (1992), 
which gives the lower bound for the asymptotic variance of an (regular) 
estimator satisfying some conditional moment conditions; see also Bickel, 
Klaassen, Ritov and Wellner (1993) for a more general treatment of efficient 
and adaptive inference in semiparametric models], the semiparametric effi- 
ciency bound for the inverse of the asymptotic variance of an estimator of 
7 is 



(15) J = inf E[(wi - g(xi,Zi))(vai[ui\wi,Xi,Zi\) 1 {w i - g(xi, Zi))'}. 

g&G 

Under the conditional homoskedastic error assumption v&r[ui\wi,Xi,Zi] = 
a 2 , then (15) can be rewritten as (m(xi,Zi) = Eg(wi)) 

J = —z inf E[(wi - g(xi,Zi))(wi - g(xi,Zi))'] 

(16) = —^E[(wi - m(xi,Zi))(wi - m(x i: Zi))'\ 

1 , $ 

Note that the inverse of (16) coincides with S = <r 2< I>~ 1 , the asymptotic 
variance of y/n{p/ — 7) when the error is conditional homoskedastic. Hence, 
S _1 = Jo and 7 is a semiparametrically efficient estimator under the condi- 
tional homoskedastic error assumption. 

The next theorem gives the convergence rate of (3i(z) = pf l (z)a^ 1 to 0i(z) 
for I = 1, . . . , d. 

Theorem 2.2. Under Assumptions 2.1-2.3, let S z denote the support 
of Z{. Then we have, for I = 1, . . . , d: 

(i) su Pze5 , \h(z) - h{ Z )\ = o p (c (K)(VK/^i+j:f=i k 5 '))- 

(") 1 52=1 " = O p (K/u + Ef =1 K 2S <). 

(iii) /(to - (3i(z)) 2 dF z {z) =O p {K/n + Y$=ik; 251 ), where F z is the 
cumulative distribution function of Z\ . 

The proof of Theorem 2.2 is given in the Appendix. 

Newey (1997) gives some primitive conditions for power series and B- 
splines such that the Assumptions 2.1-2.3 hold. We state them here for the 
readers' convenience. 
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Assumption 2.4. (i) The support of (xi,Zi) is a Cartesian product of 
compact connected intervals on which (xj,Zj) has an absolutely continu- 
ous probability density function that is bounded above by a positive con- 
stant and bounded away from zero; (ii) for I = 1, . . . ,d, fi(x,z) is continu- 
ously differentiable of order q on the support 5, where fi(x,z) = xiPi(z) or 
fl(x,z) =mi(x,z). 

Assumption 2.5. The support of (x^Zi) is [— 1, l] d+r . 

Suppose that a smooth function 7](z) (z S R r ) is continuously differen- 
tiable of order c. It is well established that the approximation error by using 
power series or i?-splines is of the order of 0(K~ c / r ); see Lorentz (1966), 
Andrews (1991), Newey (1997) and Huang (1998). Therefore, Assumption 

2.3(i) holds for power series and f?-splines if \/n(J2?=ik l Cl ^) = o(l) (i.e., 
6i = ci/r). Newey (1997) shows that, for power series or splines, Assump- 
tion 2.4 implies that the smallest eigenvalue of E[P K (xi)P K (xi)'] is bounded 
for all K. Also, Assumptions 2.4 and 2.5 imply that Assumptions 2.2 and 
2.3 hold for S-splines with CoC^Q = 0(yK). Hence, we have the following 
results for regression splines. 

Theorem 2.3. For splines, if Assumptions 2.1, 2.4 and 2.5 are satisfied, 
and kf/n — > as n — ► oo for I = 1, . . . , d, then: 

(i) The conclusion of Theorem 2.1 holds. 

(ii) The conclusion of Theorem 2.2 holds with replacing £q(K). 

Theorem 2.2 only gives the rate of convergence of the series estimator for 
the varying coefficient function f3(z). As we mentioned in the Introduction, 
it is difficult to obtain asymptotic normality results for the series estima- 
tor of (3{z) under optimal smoothings. The reason is that the asymptotic 
bias of the series estimator is unknown in general. Recently, Zhou, Shen 
and Wolfe (1998) have obtained an asymptotic bias for univariate spline 
regression functions that belong to C p (i.e., the regression functions have 
continuous pth derivatives) under somewhat stringent conditions such as 
the knots are asymptotically equally-spaced, and the degree of the spline 
m is equal to p — 1. See Huang (2003) for a more detailed discussion on 
the difficulty of obtaining the asymptotic bias for general cases with splines. 
Alternatively, one may choose to undersmooth the data. In this case the 
bias is asymptotically negligible. Huang (2003) has obtained the asymptotic 
distribution of spline estimators under quite general conditions (provided 
the data are slightly undersmoothed). Huang, Wu and Zhou (2002, 2004) 
have further provided asymptotic distribution results for spline estimation 
of a varying coefficient model. Their results can be directly applied to obtain 
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the asymptotic distribution of (3(z) in a partially linear varying coefficient 
model. This is because 7 — 7 = O p (n- l l 2 ), which converges to zero faster 
than any nonparametric estimation convergence rate. Therefore, (3{z) has 
the same asymptotic distribution whether one uses the estimator 7 or the 
true 7, the latter becomes a varying coefficient model (when 7 is unknown) 
and the results of Huang, Wu and Zhou (2002, 2004) apply. 

3. Monte Carlo simulations. In this section we report some simulation 
results to examine the finite sample performance of our proposed estima- 
tor, and also compare it with the kernel-based profile likelihood estimator 
suggested by Fan and Huang (2002). We first consider the following data 
generating process (DGP): 

(17) DGPl:yi = l + 0.5wi + aJi/3i(zi)+Ui, i = l,...,n, 
where 

(18) /?i(z J ) = l + (24z i ) 3 exp(-24z i ) 

is taken from Hart (1997), 0$ = 1 and 7 = 0.5. The error u^s are i.i.d. normal 
with mean and variance 0.25, z% is generated by the i.i.d. uniform[0, 2] 
distribution, Wi = vu + 2vzi and X{ = V2i + v^i, where Vji, j = 1, 2, 3, are i.i.d. 
uniform[0, 2] . 

We also consider a second data generating process: 

(19) DGP2:y i = A + 0.5wi + Xi 1 p 1 (z i ) + x i 2p2{zi)+u i , i = l,...,n, 

where (3\{zi) is the same as in DGP1, ^{zi) = Zi + sin(zj), Zi is i.i.d. uniform[0, 2] 
Ui is i.i.d. normal with mean and variance 0.25, Wi = vu + 2v%i, xu = 
V2i + v%i, and X2i = v& + Q.hv%i, wherevji (j = 1,2,3,4) are i.i.d. uniform[0, 2]. 

The sample sizes are n = 100 and n = 200, and the number of replications 
is 5000 for all cases. We compare the estimated mean squared error (MSE) 
of 7 defined by MSE^) = Y^f=i{lj — l) 2 , and estimated mean average 
squared error (MASE) of &(•) defined by MASE0 t (-)) = ^ £f=i°[£ ££=i(Aj 
(3i{zi)) 2 (I = 1 for DGP1, 1 = 1,2 for DGP2), where 7,- and Pij(zi) are, re- 
spectively, the estimates of 7 and fti(zi) from the jth. replication based on 
one of the two methods: the B-spline method and the kernel-based profile 
likelihood method. We use a univariate cubic i?-spline basis function defined 
by 

(20) B(z\t , ...,U) = ± E(-1) J ft) [max( °' Z ~ ^' )]3 ' 

where to, . . . , £4 are the evenly-spaced design knots. The kernel estimator of 7 
is discussed at the end of Section 2. The number of terms K in series estima- 
tion and the smoothing parameter h in kernel estimation are both selected 
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Table 1 

MSE{ffj by spline and kernel methods 







DGP1 


DGP2 






n = 100 n = 200 


n = 100 n = 200 


Cubic B-spline 


MSEtf) 


0.00278 0.00133 


0.00357 0.00153 


Profile likelihood 


MS Eft) 


0.00315 0.00145 


0.00443 0.00178 



by leave-one-out least squares cross-validation. As discussed in Bickel and 
Kwon (2002), the estimation of the parametric component does not very 
sensitively depend on the choice of smoothing parameters, as long as the 
selected smoothing parameters do not create excessive bias in the estima- 
tion of the nonparametric components. In this regard, the cross-validation 
method usually performs well. (Other data driven methods in selecting K in 
series estimation include the following: the generalized cross-validation cri- 
terion [Craven and Wahba (1979) and Li (1987)] and Mallows' C p criterion 
[Mallows (1973)].) 

The simulation result is presented in Table 1. From Table 1, first we 
observe that as the sample size doubles, the estimated MSE for all three 
different estimators reduces to about half of the original values; this is con- 
sistent with the fact that all of them are y^-consistent estimators of 7. Sec- 
ond we observe that the U-spline method gives slightly smaller estimated 
MSE of 7 for both DGPs. Under the conditional homoskedastic error con- 
dition, both methods are semiparametrically efficient. Therefore, they have 
the same asymptotic efficiency. The results in Table 1 may reflect small sam- 
ple differences of the two estimation methods for the chosen data generating 
processes (DGP). It is possible that for some other DGPs the kernel method 
may have better small sample performance. In fact, a few simulated exam- 
ples cannot differentiate the finite sample performance of the two methods. 

Table 2 reports MASE(/3(z)) for the spline and the profile likelihood meth- 
ods. The spline and the kernel methods give similar estimation results for 
MASE0(z)) for both DGPs. 

The results of Tables 1 and 2 are based on the least-squares cross-validation 
selection of K (for spline) and h (for the profile likelihood method). To ex- 
amine whether our findings only reflect a particular way of selecting the 
smoothing parameters (the cross-validation method), we also compute the 
MSE(j) and MASE(f3{-)) for a range of different values of K and h without 
leave-one-out in the estimation. Figures 1 and 2 plot the estimation results. 

In Figure 1(a) the dashed line plots the leave-one-out cross-validation 
function for a range of K for the spline method (DGP1, n = 100, average 
over the 5,000 replications). We observe that the cross validation function 
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Fig. 1. (a) CV function and SSR (spline, DGP1, n = 100j. (b) MSEfi) (spline, DGPl, 
n = l<QQ). (c) MASE(P(z)) (spline, DGPl, n = 100j. 




Fig. 2. (a) CV function and SSR (kernel, DGP1, n=WO). (b) MSEfr) (kernel, DGP1, 
n = WO). (c) MASE(P(z)) (kernel, DGP1, n = 10Q). 



EFFICIENT ESTIMATION 



13 



is minimized around K = 10. The solid line in Figure 1(a) is the sum of 
squared residuals computed without using the leave-one-out estimator; as 
expected, it decreases as K increases. 

Figure 1(b) graphs the MSE{fy) computed using all observations (not us- 
ing the leave-one-out method). We see that MSE(^) takes minimum values 
around K = 10 and K = 11. Figure 1(c) plots the MASE(/3(z)), again com- 
puted using all observations. MASE((3(z)) assumes minimum values around 
K = 10. The average of 5,000 cross-validation selected K's is 10.42. 

From Figure 1 we can see that, on average, the least squares cross- 
validation method performs well in selecting K that is close to values of 
K that minimize MSE{^) and MASE{(5(z)). Note that both Figures 1(b) 
and 1(c) do not use the leave-one-out estimator. Therefore, unlike the sum of 
squared residuals, MSE(j) and MASE((3(z)) do not monotonically decrease 
as K increases. 

Figure 2 gives the corresponding cases for the profile kernel method. 
Figure 2(a) shows that the cross-validation function is minimized around 
h = 0.04, while the sum of squares of residuals monotonically increases with 
h. 

Figures 2(b) and 2(c) show that both MSE(j) and MASE0{z)) are min- 
imized around h = 0.04. Note that Figures 2(b) and 2(c) are computed using 
all observations (without using the leave-one-out method). Therefore, similar 
to the spline case, MSE{*/) and MASE{j3{z)) do not decrease monotonically 
with h, but rather they are both minimized around the value of h that 
minimizes the cross-validation function. 

Summarizing the results of Figures 1 and 2, we find that the cross- 
validation method performs adequately for the simulated data. The sim- 
ulation results reported in this section show that both the spline and the 
kernel methods can be a useful tool in estimating a partially linear varying 
coefficient model. 

4. An empirical application. In this section we consider estimation of 
a production function in China's manufacturing industry to illustrate the 

Table 2 

MASE(/3(-)) by spline and kernel methods 

DGP1 DGP2 
MASE0 t (-)) MASE0x(-)) MASE(0 2 (-)) 

n = 100 n = 200 n = 100 n = 200 n = 100 n = 200 



Cubic B-spline 0.0162 0.00764 0.0576 0.0245 0.0635 0.0326 
Profile likelihood 0.0224 0.0110 0.0815 0.0356 0.0593 0.0318 
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application of the partially linear varying coefficient model. The data used 
in this paper are drawn from the Third Industrial Census of China con- 
ducted by the National Statistical Bureau of China in 1995. The Third 
Industrial Census of China is currently the most comprehensive industrial 
survey in China. To avoid heterogeneity across different industries and also 
to maintain enough observations in the sample for accurate semiparametric 
estimation, we include firms from the sector of food, soft drink and cigarette 
manufacturing in this study. After removing firms with missing values, the 
sample size we use is 877. We estimate a benchmark parametric linear model 
as follows: 

(21) lnY = A, + 7 hiw + A InL + (3 k lnK + f3 z Inz + u, 

where Y is the sales of the firm, w is the liquid capital, L is the labor input, 
K is the fixed capital and z is the firm's R&D (all monetary measures are 
in thousand RMB, the Chinese currency). 

The partially linear varying coefficient model is given by 

(22) In Y = 7 ln w + (3 {z) + 0i(z) InL + (3 k (z) InK + u. 

Here we choose liquid capital as the w variable whose coefficient does not 
depend on the firm's R&D spending (z). We have given some theoretical 
arguments for this model specification in the Introduction; to justify this 
choice statistically, we test both models (21) and (22) against a more general 
semiparametric varying coefficient model, 

(23) In Y = 7(2) In w + (3 (z) + A (z) In L + (3 k (z) In K + u. 

Obviously, (23) includes (22) as special case when 7(2) is constant for all z. 
We use quadratic and cubic splines and the number of knots is chosen by the 
least squares cross-validation method. The cross-validation method selected 
the quadratic spline. Our test for the null models (21) and (22) is based 
on (RSSo — RSS)/ RSS , where RSSo is the residual sum of squares from 
the null model, and RSS is from the alternative model (23). We obtain the 
critical values of our test based on 1,000 residual-based bootstrap procedures 
where we first obtain the residuals from the null model, from which we 
generate two point wild bootstrap errors, which in turn are used to generate 
bootstrap lnY's (using the estimated null model); the bootstrap statistic 
is (RSSq — RSS*)/ RSS* , where RSSq is the residual sum of squares from 
the null model computed using the bootstrap sample and RSS* is computed 
from the alternative model also using the bootstrap sample. Note that the 
bootstrap sample is generated according to the null model. Therefore, the 
bootstrap statistic approximates the null distribution of the original test 
statistic even when the null hypothesis is false. When testing the parametric 
null model, we firmly reject the null model with a p-value of 0.001. For 
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testing the partially linear varying coefficient model (22), we cannot reject 
this null model at conventional levels (a p- value of 0.162). Therefore, both 
economic theory and the statistical testing results support our specification 
(22). 

The estimated value of 7 based on (22) is 0.481, with a standard error 
of 0.0372 (the t-statistic is 12.91). The goodness-of-fit R 2 is 0.566 [R 2 = 
1 — RSS I J2i{Ui ~ V) 2 ■, Hi = lnli]. The estimated varying coefficient functions 
are plotted in Figures 3(a) to 3(c). Po(z) is plotted in Figure 3(a). Figure 3(b) 
shows that the marginal productivity of labor Pi(z) is a nonlinear function of 
z (R&D). The marginal productivity of labor first increases with z and then 
decreases as z increases further. The bell shape of the curve suggests that, 
while modest R&D can improve labor productivity, higher R&D leads to 
lower labor productivity. Figure 3(c) shows that the marginal productivity 
of (fixed) capital is also nonlinear in z. It exhibits a general up trend with z, 
indicating that firms with large R&D spending yield relative higher marginal 
(fixed) capital productivity. These results are not surprising given that most 
of the firms in our sample are state-owned. It is typical in these firms that 
capital is scarce while labor is excessive. Thus, most of the R&D expenses are 
used to improve equipment performance, but not to train labor. In Figure 
3(d) we graph the return to scale function 7 + (3i{z) + f3k( z )- The return to 
scale is well below one (the constant return to scale level) for firms with 
small R&D, and it increases to a range between 0.8 to 0.9 for firms with 
large R&D expenditures. The results indicate that most of the firms in our 
sample exhibit decreasing returns to scale in production. It partly reflects the 
fact that the firms included in the survey are large firms, most of which are 
state-owned firms. These firms typically have a production scale larger than 
ideal. In particular, there are usually too many employees in these firms. It 
was not until several years after the survey we use in this paper, as a result 
of fierce competition from foreign firms and the passage of bankruptcy law 
in China, that the food, soft drink and cigarette sector witnessed a string of 
reorganizations, mergers and acquisitions. Further discussion is beyond the 
scope of this paper. 

We have also applied the kernel profile likelihood method to this data 
set. The estimation results are quite similar to those obtained by the spline 
method. For example, the estimated 7 is 0.489 with a t-statistic of 13.20. 
The (3{z) functions all have similar shapes as those obtained by the spline 
method. Therefore, we do not report the kernel estimation results here. 

5. Possible extension. In this section we briefly discuss (without provid- 
ing technical details) efficient estimation of a partially varying coefficient 
model when the error is conditional heteroskedastic. 

Theorem 2.1 holds even when the error is conditional heteroskedastic, 
say, E{u 2 \vi) = a 2 (vi), where Vi = (wi,Xi,Zi). However, in this case 7 is not 
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(c) 

Fig. 3. (a) bo(z) (spline), (b) bi(z) (spline), (c) bk(z) (spline). 
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(d) 



Fig. 3 (continued), (d) Return to scale (spline). 



semiparametric efficient. An efficient estimator can be obtained by dividing 
each term in (5) by ai = yJo 2 (vi): 



(24) 



— 7 H 1 



We estimate (7', f3{zi)') by the least squares regression of Yijoi on (wi/ai,p K (xj, zi)' joi 
The transformed error Uj/<Tj becomes conditional homoskedastic. Under the 
assumption that < 771 < inf„ o 2 (v ) < sup„ a 2 {v ) < rj2 < 00 for some positive 
constants r\\ < r]2, by the same arguments as in the proof of Theorem 2.1, 
one can show that 



\/n(7 - 7) -> N(0, Jq" 1 ^^ 1 ) = ^(0) J o _1 ) in distribution, 



where 
(25) 

and 



Jo = inf E{[ Wi - x£(zi)][wi - x^izi)}'/* 2 ^)} 



A = inf S{[-u;i - x'i€(zi)][wi - x'^Zi)] 'uf / 'a 4 (vi)} 

= j n f E{[w,i - Xi€(zi)][wi - x'^(zi)}' /a 2 (v,i)} = J . 

Therefore, by the result of Chamberlain (1992), we know that 7 is semi- 
parametrically efficient. Note that if we let a(x, z) = x'£(z) £ Q denote the 
solution of the minimization problem of (25), that is, E{[wi — a(xi, Zi)] [wi — 
a(xi,Zi)]' /<J 2 {vi)} = \ni^gE{[wi-x[£{zi)}[wi-x[£(zi)]' /a 2 {vi)}, then a(x, z), 
in general, differs from m(x,z) = Eg{wi) defined in (16) because of the 
weighting function l/a 2 (vi). 
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It is unlikely that <r 2 {vi) is known in practice. Let cr 2 (vi) denote a generic 
nonparametric estimator of a 2 {vi), and write <7j = y/a 2 (vi). Then one can ob- 
tain feasible estimators for 7 and f3(z) by regressing Yijdi on [w' i /ai,p K (xi, Zi)/&i}. 
The resulting estimator of 7 will be semiparametric efficient provided that 
<t(u) converges to u{v) uniformly with a certain rate for all v in the compact 
support of v. 

For the kernel-based profile likelihood approach, it is more difficult to 
obtain efficient estimation when the error is conditional heteroskedastic. 
Recall that Eg(Ai) denotes the projection of Ai on the varying coefficient 
functional space Q. From (5) we have 

(26) yi- E g {yi) = {wi- E g (wi))'^ + Ui. 
Dividing each term in (26) by cr^, we get 

/07 x yi-Eg(yi) (wj-Egjwi))' m 

(27) = 7H . 

Let 7 denote the least squares estimator of 7 based on (27). By the Lin- 
deberg central limit theorem, we have 

y^(7 - 7) -> N(0, {E[(wi - Eg( Wi ))( Wi - Eg{wi))' /a?]}" 1 ) 

(28) 

in distribution. 

However, 7 is not semiparametrically efficient because 
E[( Wi - Eg( Wi ))(Wi - E g (wi))'/a 2 } 

^ ME[(wi - g(xi, Zi))(wi - g(xi,Zi))' /a 2 (vi)] 

due to the weight function 1 /af. [Eg{wi) is defined as the (un- weighted) pro- 
jection of Wi on the varying coefficient functional space Q. It differs from the 
weighted projection in general.] We conjecture that some iterative procedure 
(similar to the backfitting algorithm) is needed in order to obtain an efficient 
kernel-based estimator for 7 when the error is conditional heteroskedastic. 

APPENDIX 

Throughout this Appendix, C denotes a generic positive constant that 
may be different in different uses, Yli = Ya=i- The norm || • || for a matrix 
A is defined by \\A\\ = [tr(A'A)] 1 ^ 2 . Also, when A is a matrix and a n is a 
positive sequence depending on n, A = O p (a n ) [or o p (a n )] means that each 
element of A is O p (a n ) [or o p (a n )]. Also, when we write A < C for a constant 
scalar C, it means that each element of A is less or equal to C. 

Proof of Theorem 2.1. Recall that 9(xi, Zi) = E[wi\xi, zi\ } m(xi, Zi) = 
Eg(wi) = Eg(9(xi,Zi)) and £j = Wi - m(zi,Xi). Define Vi = Wi - 0(xi,Zi) 
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and rji = 6(zi,Xi) — m(xi,Zi). We will use the following short-hand nota- 
tion: 9i = 9(xi,Zi), gi = x' i P(zi) and rrii = m(xi,Zi). Hence, Vi = Wi — 9i, 
£% = Qi + v i ~ m i-> Vi = @i ~ m i- Finally, the variables without subscript rep- 
resent matrices, for example, 9 = (61, . . . , 9 n )' is of dimension nxl. 

Also recall that for any matrix A with n rows, we define A = P(P'P)~ P'A 
[P is defined below (6)]. Applying this definition to 9,m, g,rj,u,v, we get 
6,m,g,ij,u,v. 

Since W{ = 9{ + V{ and 9i = rrii + Vi, we get Wi = T]i + v% + rrii and Wi = 
Vi + Vi+rhi. In matrix notation, 

W = rj + v + m and W = fj + v + rh. 

Therefore, we have 

(29) W — W = r]-\-v + {m — rh) — v — fj. 

For scalars or column vectors A4 and B{, we define Sa,b = n z2i-^4^'i 
and Sa = Sa,a- We also define the scalar function Sa = n Y^i-^-^i> which 
is the sum of the diagonal elements of 5,4. Using ab < (a 2 + b 2 )/2, it is 
easy to see that each element of Sa,b is less or equal to Sa + Sb- When 
we evaluate the probability order of Sa,b, we often write Sa,b < Sa + Sb- 
The scalar bound Sa + Sb bounds each of the elements in Sa,b- There- 
fore, if Sa + Sb = O p {a n ) (for some sequence a n ), then each element of 
Sa,b is at most O p (a n ), which implies that Sa,b = O p (a n ). Similarly, using 
the Cauchy-Schwarz inequality, we have Sa,b < (SaSb) 1 ^ 2 ■ Here again, the 
scalar bounds all the elements in Sa,b- 

Note that if S^ ~ exists, then, from (10) and (11), we get 



^(7-7)= n 1 ^(w i -Wi)(w i -Wi V 

(30) 

xV^<n > (wi-Wi)(gi -gi + Ui 



^W— W ^™^W— W,g— g+u-u ' 

where gi = x' i (3(z i ). 

For the first part of the theorem, we will prove the following: (i) S w ^ = 

$ + Op(l), (ii) S w _~ g _„ = o p (n~ 1 / 2 ), (iii) S w _fr- = o p (n~ 1 / 2 ) and (iv) y/nS w _^ 

N(0, Q) in distribution. 

PROOF of (i). For a matrix A and scalar sequence a n , A = O p (a n ) 
(o p (a n )) means that each element of A has an order of O p (a n ) (o p (a n )). 
Using (29), we have 

(31) S^y_^y S r i+ V _\_( rn _ff 1 ^_y_ l jj S r j-\- v + S^m—m^—v—f) ^&ri+v,(m—m)—v—rj- 
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The first term S v+V = \ EM + Vi)(w + «*)' = ± Ei d< = $ + ° P (1) by 
virtue of the law of large numbers. 

The second term S( m -fii)-v-ij < 3(Sr m _jfA + S$ + Sfj) = o p (l) by Lemmas 
A. 3, A.4(i) and A. 5, stated and proved at the end of this Appendix. 

The last term S , ^ + „ >(m _^ ) _ { ;_^ < {SVf^(m-m)-S-f?} 1/2 = (Op(l)op(l)) 1 / 2 = 
Op(l) by the preceding results, where for an m x m matrix v4, Diag(j4) is 
an m x 1 matrix with the diagonal elements of A, and A 1 / 2 has the same 
dimension as A by taking the square root for each element of A. □ 



PROOF of (ii). Using (29), we have 
(32) 



^W-W,g-g ^■n+v+{m-rh)-v-i},g-g 



— S v +v,g-g + S m -rh,g-g Sv,g-g ^fj,g-g- 

For the first term, by noting that rji + u, is orthogonal to the vary- 
ing coefficient functional space Q, and gi — gi belong to Q, we have us- 
ing Lemma A.3, E[\\S v+v , g _g\\ 2 } = n" 2 £™ =1 E[{ m + + - ft) 2 ] < 
C^ -1 (Ef=i kf l ) x S[||»ti + vi|| 2 ] = 0{n~ l Ef=i A;, 2 * 1 ) = ofa -1 ), which implies 
that = O p (n-V2 ^f =1 ft"*). 

The second term S m -fk,g-g < {S m -rhS g -~g) l/2 = Op(Y,f=ikf 2Sl ) by Lem- 
ma A.3. 

The third term S^-g < (S.Sg^) 1 / 2 = O p ({K/n) 1 / 2 )O p (Y l i=i k^ Sl ) by 
Lemmas A.3 and A.4(i). The last term Sf,, g - g < (S^Sg-g) 1 / 2 = O p ((/c/n) 1 / 2 ) x 
O p (E?=i V 5 ') W Lemmas A.3 and A.5. 

Combining the above four terms we have S w _^ = O p ((ri -1 / 2 + (If/n) 1 / 2 )^^! fcj -<5l ) + 

W=l 



Ef=i V 25i ) = Op(?^~ 1/2 ) by Assumption 2.3. □ 



PROOF of (iii). Using (29), we have 

(33) ^W—Wu = ^V+v+(m—m,)—v—f),u = Sry+tsu + S m —fh,u ~ S^u — Sfj,u- 

The first term S v+Vt u < {Sq+vSy,) 1 ' 2 = O p (K/n) by Lemma A.4(ii). The 

second term S m -ih,u < {S m -mSu) l/2 = O p (J2f=i K l )O p {\fK / ^/n) by Lem- 
mas A.3 and A.4(ii). 

The third term S^a < (SySu) 1 / 2 = O p {K/n) by Lemma A.4(i), (ii). The 
last term S^a < (SfjSy) 1 ' 2 = O p (K/n) by Lemmas A.4(ii) and A. 5. 

Combining all four terms, we get S w _^ _ = O p {K/n + n~ l l 2 J2i=i k^ Sl ) = 
o p (n -1 / 2 ) by Assumption 2.3. □ 



PROOF of (iv). Using (29), we have 
(34) 
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The first term y/n~S v+v , u = \/nJ27=i(Vi + = V"E"=i £ i M i N(0,tt) 
in distribution by the Lindeb erg-Feller central limit theorem. 

The second term E[S^ n _^ h U \X, Z] = ^ tr {(m — m)(m — rh)' E[uu'\X, Z]} < 

(C/n) tr[(m — m)'(m — rh)/n] = (C/n)5 m _m = o p (n~ l ) by Lemma A. 3. Hence, 

Sm—rh,u — Op (jl ). 

The third term E[S^ U \X, Z] = ±r ti(P(P' Py l P'vv' P{P' P)~ l P' E[uu'\X, Z]) < 
(C/n 2 )tv[P{P'Py 1 P'vv'P{P'Py 1 P'] = (C/n)tT(vv'/n) = (C '/n)Sa = o p (n~ l ) 
by Lemma A.4(i). Hence, Sv, u = o p (n~ 1 / 2 ). 

The last term Sf,^ = o p (n~ 1 / 2 ) by the same proof as iSg )tl = o p (n~ 1 / 2 ) by 
citing Lemma A. 5, rather than citing Lemma A.4(i). □ 

Combining proofs of (i)-(iv) with (30), we conclude that V™(7 — 7) — > 
^(O,^" 1 ^^" 1 ) in distribution. 

For the second part of the theorem, we need to show that £ = £ + o p (l), 
where £ = <&~ 1 il ( l )_1 . But 4> = S w ^ = $ + o p (l) is proved in the proof 

of (i) above. By a similar argument, it is easy to show that = Q + o p (l). 
Therefore, S = S + o p (l). □ 

Proof of Theorem 2.2. We will prove Theorem 2.2 by replacing 

(3(z) and (3{z) by g(x,z) = x'(3(z) and g(x,z) = x'f3(z), respectively, because 
\g(x, z) - g(x, z)\ 2 = \x'0(z) - (3(z))\ 2 < d£f =1 xf0i(z) - friz)) 2 , which has 
the same order as \\P(z) — (3(z)\\ 2 under the bounded support assumption. 
Hence, the rate of convergence for g(x, z) — g(x, z) is the same as that of 
0(z)-0(z). 

The proof is similar to the proof of Theorem 1 in Newey (1997). Define an 
indicator function l n which equals 1 if (P'P) is nonsingular and otherwise. 
We first find the convergence rate of l n ||d — a\\. By (12) and (7), and if 
(P'P)" 1 exists, we have 

& = (P'Py 1 P'(Y - Wj) 

= (p'py 1 p\Y - Wj - w(ft - 7)) 

(35) = (P'P)- l P'(Pa + {G-Pa) + u- W{<y - 7)) 

= a + (P'P/ny 1 P'(G - Pa) jn + (P' P/n)" 1 P'u/n 
- (P'P/n)- l P'W(^ - i)/n. 

Hence, 

ln||a - a|| < l n \\(P'P/n)- l P'(G - Pa)/n\\ 

(36) +l n ||(P'P/n)- 1 PV"-ll 

+ l n ||(P'P/n)- 1 P , Pv-(7-7)/7i||. 
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The first term l n || {P' P / n)~ l P' {G - Pa)/n\\ = O p (J2f=ik^ Sl ) by Lem- 
ma A. 2. 

The second term 

E[l n \\{P'P/n)- l P'u/n\\\X,Z\ 

= l n E[((u'P/n)(P'P/ny 1 (P'P/n)- 1 {P'u/n)) l/2 \X,Z] 

< O p (l)l n tv(P{P'P)- l P'E[uu'\X, Z]/n) 1/2 

< O p (l)l n CVK/V^ 

by Lemma A. 1 and Assumption 2.1. Hence, l n \\(P' P/n)~ l P'u/n\\ = O p {\/~K / y/n). 

As for the last term, note that W = r] + v + m = e + m and 7 — 7 = 
O p (n~ 1 / 2 ) by Theorem 2.1. Therefore, 

E[l n \\(P'P/n)- 1 P'W/n\\\X,Z\ 

= l n E[\\(P'P/ny 1 P'{e + m)/n\\\X,Z] 

<l n ^[||(P / P/n)- 1 P / e/n|||X,Z] + l n S[||(P , P/n)- 1 P , m/n|||X,Z]. 

Also, 

l^UKP'p/^-^v^lll^z] 

= l ri P[||( e 'P/n)(P'P/n)- 1 (P'P/n)- 1 (P / e/^)lll^^] 

< O p (l)l n tr(P{P , P)- 1 P , E[ee'\X, Z]/n) 1/2 

< O p (l)l n CVK/V^ 

by Lemma A. 1 as in the proof of Theorem 2.1. Hence, l n \(P'P/n)~ 1 P'e/n\ = 

o p (Vk/V^) = o p {i). 

l n \\{P'P)~ l P'm\\ = lnlKP'P/n^P'm/nW = O p {\) by Lemma A.2. 
Combining the above results, also noting that l n — > 1 almost surely, we 
have 

(37) \\a-a\\=O p (^K 5 ' + 

To prove part (i) of Theorem 2.2, using (37) and Assumption 2.3, and 
also noting that g(x,z) = x (3{z) = p (x, z)'a, we have 

sup \g(x,z) -g{x,z)\ < sup \p K {x, z)'(a - a)\ + \p K {x, z)'a - g(x, z)\ 
<( (K)\\a-a\\ + o(^2K Sl ^ 
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Proofs for (ii) and (iii) are similar, and we only prove (ii), 

n 

n ~ 1J }2[9{xi,Zi) -g(zi,Zi)] 2 



i=l 



n 



^WPa-GW 



2 



<2n- 1 {ll^(a-«)H 2 + ll^a-G|| 2 } 

= 2(a — a) 1 (P 1 P / n){a — a) + 2 sup [p K (x, z)a — g(x, z)] 2 



ojlt/n + ^kl 
\ i=i 



(x,z)es 



by (37), Lemma A.l and Assumption 2.3(i). Thus, we have proved Theorem 
2.2. 

□ 

We now present some lemmas that are used in the proofs of Theorems 2.1 
and 2.2. We will omit the indicator function l n below since Prob(l n = 1) — ► 1 
almost surely. Following the arguments in Newey (1997), we can assume 
without loss of generality that B = I (B is defined in Assumption 2.2). 
Hence, P K (X,Z) =p K (X,Z), and Q = E\p K (x h Zi)p K (x u Zi )'] = I (I is an 
identity matrix of dimension K); see Newey (1997) for the reasons and more 
discussion of these issues. Recall that p (x, z) is a K x 1 matrix and rewrite 
each component of this matrix as p K (x, z) = (piK (%, z), . . . ,Pkk{x, z))' . 

Lemma A.l. \\Q - I\\ = O p (Co(K)VK/Vn) = o P {l), where Q = P'P/n. 
PROOF. This is Theorem 1 in Newey (1997). □ 

Lemma A. 2. ||a/-a/|| = O p (Ef=i kf Sl ), where a } = {P' P)~ 1 P' f , a f sat- 
isfies Assumption 2.3 and f = G or f = m. 

Proof. By Lemma A.l, Assumption 2.3 and the fact that P(P'P)~ l P' 
is idempotent, 

\\a f -a f \\ = \\{P'Py 1 P'{f -Pa f )\\ 

= \\(f- Pa f )'P{P'Pr l QP\f - Pa f )/nfl 2 
< O p (l)\\(f - Pa f )'P(P'P)- l P'(f - Pa f )/n\\ 1 / 2 
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< O p (l)|| (/ - Pa f )'(f - Pa f )/n\\^ = O p (j^ ^) ■ n 

Lemma A. 3. S f s = O p (J2?=i k^ 2Sl ), where f = G or f = m. 

Proof. Note that / = Potf. By Assumption 2.3 and Lemmas A.l and 
A.2, 

S f -; = \\f ~ f? <\{\f ~ Pa f \ 2 + \P(a f - a f )\ 2 ) 
= o(it k r 25 ^ + («/ " a f )'(P'P/n)(a f - a f ) 

< O (j2 ^) + O p (l)\a, -a f \ 2 = O p (jy~ 2S ^ . 



□ 



Lemma A. 4. (i) Sg = O p {K/n), (ii) S a = O p {K/n). 

Proof, (i) This proof is similar to the proof of Theorem 1 of Newey 
(1997), 

E[Sv\X, Z] = -E\v'P(P'P)- 1 P'v\X,Z\ 
n 

= -E[tx(P(P / P)- 1 P'E[vv , \X, Z})} 
n 

<-tv(P(P'P)- 1 P') = c(-). 
n \n J 

Hence, S$ = O p (K). 

(ii) follows as in the proof of Lemma A.4(i). □ 
Lemma A . 5 . Sf l = O p (K/n). 

PROOF. First we show that (P'rj/n) = O p (\[K / 'y/n). Recall that 9(xi, z/) = 
E(wi\xi, Zi) and rji = 9(xi,Zi) — Eg[9(xi,Z{)]. Note that p K (xi,Zi) € Q and 
Eg( m ) = (i.e., n L Q). Hence, E\\P'n/nf = n~ 2 £ 4 E\p K (x i y\\r H \\ 2 P K ( Xi )] < 
^E\p K (X i )'p K (x i )] = %ti{E\p K (X i )p K (x i )'}} = (CK/n)=Q(K/n), which 



implies that (P'rj/n) = O p (y K / y/n) . 

Thus, Sfj = n^fj'rj = (n' P/n)(P' P/n)' 1 (P'rj/n) = O p (K/n)O p (l) = O p (K/n) 
by Lemma A.l and the fact that P'rj/n = O p (\fK / y/n) as shown above. □ 
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